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Abstract  —  The  ability  to  predict  the  intentions  of  people  based 
solely  on  their  visual  actions  is  a  skill  only  performed  by  humans  and 
animals.  This  requires  segmentation  of  items  in  the  field  of  view, 
tracking  of  moving  objects,  identifying  the  importance  of  each  object, 
determining  the  current  role  of  each  important  object  individually  and 
in  collaboration  with  other  objects,  relating  these  objects  into  a 
predefined  scenario,  assessing  the  selected  scenario  with  the 
information  retrieve,  and  finally  adjusting  the  scenario  to  better  fit  the 
data.  This  is  all  accomplished  with  great  accuracy  in  less  than  a  few 
seconds. 

The  intelligence  of  current  computer  algorithms  has  not  reached 
this  level  of  complexity  with  the  accuracy  and  time  constraints  that 
humans  and  animals  have,  but  there  are  several  research  efforts  that 
are  working  towards  this  by  identifying  new  algorithms  for  solving 
parts  of  this  problem. 

This  survey  paper  lists  several  of  these  efforts  that  rely  mainly  on 
understanding  the  image  processing  and  classification  of  a  limited 
number  of  actions.  It  divides  the  activities  up  into  several  groups  and 
ends  with  a  discussion  of  future  needs. 

Keywords —  Activity  recognition,  artificial  intelligence,  visual  human 
intent  analysis,  visual  understanding. 


I.  Introduction 

Visual  Human  Action  Recognition  (VHAR)  is  an 
important  and  complex  artificial  and  computation  intelligence 
process  used  to  help  solve  many  understanding  problems.  If  a 
system  were  to  be  developed  to  recognize  violent  activity  in  a 
subway  station  [13]  then  VHAR  would  be  used  to  identify 
actions  of  each  person  shown  in  the  field  of  view.  These 
actions  would  be  combined  through  other  algorithms  to 
determine  hostility  of  people  and  possible  warn  operators  of  a 
violent  action  occurring.  Despite  the  need  for  VHAR  systems 
the  research  in  this  area  has  not  increased  to  the  level  of 
maturity  desired,  unlike  several  other  artificial  and 
computationally  intelligent  areas  which  are  very  mature  in 
development.  Providing  the  framework  for  a  robust  VHAR 
algorithm  is  difficult  when  thinking  about  all  the  different 
types  of  possible  actions  the  algorithm  has  to  be  able  to 
handle.  People  performing  the  same  action  move  in  a  variety 
of  different  ways,  thus  creating  a  non-deterministic  class  of 
actions;  that  is,  there  is  no  specific  or  deterministic  way  people 
move  for  a  particular  actions.  Even  the  movement  associated 
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with  a  single  action  performed  by  the  same  person  varies  in 
the  movements. 

This  paper  will  highlight  several  techniques  used  in 
VHAR  research  to  include  non-traditional  artificial 
intelligence  techniques,  visual  languages,  statistical 
algorithms,  and  others.  This  paper  is  meant  to  be  a  survey  of 
the  current  research  being  performed  in  the  area  of  VHAR. 
The  goal  of  this  paper  is  to  1 .)  Educate  on  VHAR  by  showing 
a  wide  variety  of  innovative  methods  used  to  improve  on 
specific  areas  of  classification  related  to  VHAR,  and  to  2.) 
Show  how  open  the  field  of  research  is  to  new  techniques  that 
will  improve  VHAR  types  of  classification  systems.  The 
papers  referenced  in  this  section  represent  a  fairly  complete 
list  of  ideas  used  to  advance  the  area  of  VHAR  and  related 
fields  of  research. 

II.  NON-TRADITIONAL  TECHNIQUES 

A  large  amount  of  research  in  VHAR  uses  visual  cues  of 
human  actions  without  any  traditional  artificial  or 
computational  intelligent  techniques  [10,  12,  15,  17,  23,  25, 
27,  31,  33,  34,  37,  43,  46,  47,  49,  50,  52,  55,  61,  62,  63], 
These  algorithms  rely  on  simplicity  at  the  cost  of  fusing  input 
data.  They  often  use  less  than  typical  data  inputs;  that  is, 
inputs  that  would  not  necessarily  be  used  by  human  observers. 
They  rely  almost  exclusively  on  the  pre-processing  of  the  data 
while  using  statistical  or  non-traditional  artificial  and 
computational  intelligent  algorithms  to  determine  the 
behavior.  M.  Cristani  et  al.  [15]  uses  both  audio  and  visual 
data  to  determine  simple  events  in  an  office.  First  they 
remove  foreground  objects  and  segment  the  images  in  the 
sequence.  This  output  is  coupled  with  the  audio  data  and  a 
threshold  detection  process  is  used  to  identify  unusual  events. 
The  fused  event  sequences  are  then  put  into  an  audio  visual 
concurrence  matrix  (AVC)  to  compare  with  known  AVC 
events.  Patterns  of  the  events  from  the  AVC  are  known  and 
determine  the  classification  of  the  action. 

Many  research  projects  in  this  area  use  their  own  form  of 
plotting  space-time  data  from  the  image  sequences  and 
calculating  the  closest  distance  to  pre-determined  or 
automatically  determined  events  to  decide  what  action  the 
human  is  performing  [10,  12,  16,  17,  19,  23,  25,  27,  30,  31, 
34,  35,  43,  50,  55,  59].  M.  Dimitrijevic  et  al.  [17]  developed  a 
template  database  of  actions  based  on  five  male  and  three 
female  people.  Each  human  action  is  represented  by  3  frames 
of  their  2D  silhouette:  the  frame  when  the  person  first  touches 
the  ground  with  one  of  his/her  feet,  the  frame  at  the  midstride 
of  the  step,  and  the  end  frame  when  the  person  finishes 
touching  the  ground  with  the  same  foot.  The  three  frame  sets 
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were  taken  from  7  camera  positions.  For  classifying  the 
action,  they  use  a  modified  Chamfer’s  distance  calculation  to 
match  to  the  template  sequences  in  the  database. 

D.  Weiland  et  al.  [10]  use  motion  history  volumes  to 
determine  human  gestures  by  extending  the  2D  pixel 
representation  with  time  to  a  3D  representation  with  time. 
This  is  accomplished  by  using  multiple  cameras  around  the 
person  and  subtracting  out  any  background  information. 
Classes  are  created  manually  for  each  action  or  gesture. 
Mahalanobis  distance  with  principle  component  analysis  is 
used  to  identify  action  from  the  appropriate  class. 

The  work  from  A.  Mokhber  et  al.  [23]  also  uses 
volumetric  models  to  determine  human  actions.  Features 
representative  of  the  action  sequence  are  extracted  from  the 
binary  images  and  used  to  compute  the  space-time  volumes. 
This  is  so  people  are  seen  globally.  The  volumes  are  formed 
from  all  the  binary  images  in  the  action  and  are  concatenated 
together  in  chronological  order.  These  volumes  make  up  a 
feature  vector  of  moments.  Comparison  of  testing  data  into 
classes  formed  from  the  training  data  is  by  Mahalanobis 
distances. 

Eigenspace  is  used  to  help  categorize  actions  from 
distance  computations  which  identify  events  by  M.  Rahman  et 
al.  [25].  They  believe  that  these  should  be  used  since  they  are 
highly  mathematical  and  require  less  image  processing. 
Motion  from  a  camera  is  captured  and  manually  placed  into 
classes  and  then  finally  into  a  covariance  matrix.  This  makes 
up  the  universal  image  set  for  that  action.  Eigenvalues  and 
eigenvectors  are  calculated  and  by  using  the  Karhunen-Loeve 
technique  the  best  ones  that  describe  the  action  are  kept.  This 
makes  up  an  orthogonal  coordinate  system.  To  recognize  an 
unknown  image  sequence  behavior,  a  distance  measure  is  used 
for  the  calculated  eigenvalues  and  eigenvectors  onto  the 
coordinate  system.  This  study  looked  at  five  different  cricket 
events  with  fairly  good  recognition. 

Blackburn  and  Ribeiro  [27]  use  manifolds  from 
isomorphic  feature  mapping  (Isomaps)  to  represent  individual 
images  of  motion  sequences.  Isomaps  are  used  to  reduce  the 
dimensionality  of  the  image  but  keep  most  of  the  features 
required  for  classification.  Scores  for  the  manifolds  are 
calculated  and  the  curves  are  either  stored  (in  training)  or 
compared  against  classified  events  using  Dynamic  Time 
Warping  (DTW).  Classification  is  by  nearest  neighbor. 

III.  TRADITIONAL  ARTIFICIAL  INTELLIEGENT 
TECHNIQUES 

Some  of  the  traditional  artificial  and  computational 
intelligence  techniques  are  used  for  classifying  human  action, 
many  with  a  spin  towards  specific  motions  [13,  19,  20,  26,  28, 
35,  40,  42,  45,  58,  63].  J.-Y.  Yang  et  al.  [19]  uses  neural 
networks  to  determine  human  actions.  They  reduce  the  errors 
associated  with  normal  human  motion  capturing  by  placing 
tags  on  body  parts  for  tracking.  They  also  strap  a  tri-axial 
accelerometer  to  the  subject’s  wrist  to  monitor  three  degrees 
of  motion  on  the  specific  body  part.  Tri-axial  data  is  captured 
at  pre -determined  time  intervals.  This  data  is  the  input  into  a 
neural  network  specifically  designed  to  determine  if  the  event 
is  static,  like  standing  or  sitting,  or  the  event  is  dynamic,  like 


walking  and  running.  Once  the  event  is  determined  to  be 
either  static  or  dynamic,  another  neural  network  is  used  on  the 
same  data,  either  a  static  event  neural  network  or  a  dynamic 
event  neural  network,  and  the  action  is  classified.  The  results 
are  promising  for  the  limited  actions  the  system  is  designed  to 
detect:  standing,  sitting,  walking,  running,  vacuuming, 
scrubbing,  brushing  teeth,  and  working  on  a  computer. 

Rule  based  and  fuzzy  systems  are  a  few  other  common 
types  of  artificial  and  computational  intelligent  technique  used 
to  identify  patterns  and  have  been  adapted  to  analyze  human 
events  [13,  16,  47].  H.  Stern  et  al.  [16]  created  a  prototype 
fizzy  system  for  picture  understanding  of  surveillance 
cameras.  His  model  is  split  into  three  parts,  pre-processing 
module,  a  static  object  fuzzy  system  module,  and  a  dynamic 
temporal  fuzzy  system  module.  The  static  fuzzy  system 
module  takes  in  the  pre-processed  data  and  outputs  the  number 
of  people  involved  in  the  scene:  a  single  person,  two  people, 
three  people,  many  people,  or  no  people.  The  dynamic  fizzy 
system  determines  the  intent  of  the  person,  or  people,  based  on 
their  global  temporal  movements.  Although  this  requires  only 
a  basic  understanding  of  human  intent  by  using  global 
movements  of  people  and  their  interactions  based  on  global 
positions,  it  is  included  in  many  application  research  programs 
within  the  U.S.  Department  of  Defense:  Near  Autonomous 
Unmanned  System  [84],  Army  Research  Lab  Collaborative 
Technology  Alliance  [85],  and  Mobile  Detection  Assessment 
and  Response  System  [86]. 


IV.  MARKOV  MODELS  AND  BAYESIAN  NETWORKS 

Developing  Markov  models  or  Bayesian  networks  are 
common  approaches  to  VHAR  research.  This  research  path 
fits  the  logical  approach  of  having  a  sequence  of  images 
making  up  an  action.  Each  sequence  image  is  looked  at 
together  with  its  consecutive  image;  similar  to  how  a  human 
recognizes  actions.  There  are  several  ways  to  develop  a 
network  from  the  input  data  [1,  8,  14,  18,  19,  30,  36,  38,  42, 
46,  48,  57].  In  the  work  of  Du,  Chen,  and  Xu  [1],  they  use  a 
Coupled  Hierarchal  Durational  -  State  Dynamic  Bayesian 
Network  (CHDS-DBN)  to  model  human  actions.  They  claim 
that  to  understand  human  actions,  frameworks  should  have 
both  motion  corresponding  to  the  interaction  as  well  as  details 
of  the  motion  on  different  scales.  For  the  most  part,  research 
did  not  include  interaction  with  other  people  as  a  determinate 
in  understanding  intent  of  a  single  person.  Most  work  is  on 
motion  characteristics  of  the  individual  alone.  This  approach 
adds  a  decision  base  to  normal  action  identification. 

The  work  by  A.  Galata  et  al.  [8]  uses  variable  length 
Markov  models  for  human  action  recognition.  They  claim 
that  using  variable  length  Markov  models  provide  a  more 
efficient  way  to  represent  behaviors  that  are  more  flexible  than 
other  common  classification  models  for  large  temporal  scale 
input  data. 


V.  GRAMMARS 

In  constructing  networks,  many  VHAR  research  uses 
grammars  [5,  22,  37,  41,  44,  50,  52,  56,  57,  58,  59,  74]  to  best 
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describe  the  sequence  of  events  the  body  makes  when 
determining  the  actions  of  the  person  visually.  Grammars  are 
mathematical  based  and  seem  to  fit  well  with  visual  action 
understanding  due  to  their  network  fashion  of  solving 
problems.  A.  Ogale  et  al.  [22]  uses  probabilistic  context  free 
grammars  (PCFG)  in  short  action  sequences  of  a  person  from 
video.  Body  poses  are  stored  as  silhouettes  which  are  used  in 
the  construction  of  the  PCFG.  Pairs  of  frames  are  constructed 
based  on  their  time  slot:  the  body  pose  from  frame  1  and  2  are 
paired,  the  body  pose  from  frame  2  and  3  are  paired,  and  so 
on.  These  pairs  construct  the  PCFG  for  the  given  action. 
When  testing  the  algorithm,  the  same  procedure  is  followed. 
Comparing  the  testing  data  with  the  trained  data  is 
accomplished  through  Bayes:  P(sk|pi)  =  P(Pi|sk)P(sk)/P(pi), 
where  sk  is  the  kth  silhouette  and  p;  is  the  ith  pose. 

The  work  from  Starner,  Weaver,  and  Pentland  [5]  also  use 
grammars  in  constructing  their  network.  Phrase  grammars 
were  used  to  distinguish  the  type  of  action  from  hand  signals 
that  can  be  networked  together  to  form  the  meaning  of  a 
sentence  signed  using  American  Sign  Language.  In  this  case, 
phrase  grammars  limit  the  search  set  of  words  to  improve  the 
accuracy  of  what  is  being  described.  They  also  speed  up  the 
process  over  not  using  grammars.  A  Hidden  Markov  Model  is 
used  to  train  and  test  the  data. 


VI.  TRADITIONAL  HIDDEN  MARKOV  MODELS 

Of  all  the  visual  human  action  recognition  networks 
constructed.  Hidden  Markov  Models  (HMM)  are  the  most 
widely  used  [3,  4,  5,  7,  8,  9,  1 1,  20,  21,  24,  26,  38,  45,  48,  51 , 
64,  80].  Hidden  Markov  Models  keep  a  network  of  body 
poses  related  to  each  other  and  provided  a  way  of  learning 
parameters  that  best  fit  a  set  of  training  data  with  known 
classifications.  Yamato  et  al.  [3]  use  HMMs  to  recognize  six 
tennis  strokes  with  a  25x25  mess  feature  matrix  to  describe 
body  positions  in  each  frame.  Campbell,  Becker,  and 
Azarbayejani  [4]  used  HMMs  to  recognize  eighteen  Tai  Chi 
moves.  Each  move  was  represented  by  a  series  of  vectors 
formed  by  the  3D  position  of  the  head  and  the  hands.  Yu  and 
Ballard  [9]  use  HMMs  to  distinguish  similar  action  based  on 
head  and  eye  movements.  The  work  of  Lee  and  Kim  [7] 
shows  how  to  use  HMMs  for  gesture  recognition.  A  threshold 
model  is  built  to  provide  dynamic  threshold  values  to 
distinguish  between  meaningful  and  meaningless  gestures.  If 
the  action  is  within  a  pre-defined  threshold  then  it  is 
considered  a  meaningful  action.  HMMs  are  used  to  train  and 
test  the  predefined  gestures.  Gehrig  and  Schulz  [24]  used 
HMMs  to  recognize  ten  kitchen  actions  based  on  the 
movement  of  twenty  four  points  on  the  upper  body.  They 
looked  at  skeletal  data  and  calculated  the  correct  movements 
of  people  and  reduce  the  number  of  body  parts  down  to 
thirteen  with  similar  results.  Gao  et  al.  [26]  used  both  Optical 
Flow  Tensors  (OFT)  and  HMMs  to  distinguish  basketball  shot 
actions  from  video.  Optical  flow  fields  are  modeled  from  the 
video  frames  at  several  resolutions  and  a  tensor  is  built,  this  is 
the  OFT.  Reducing  the  dimensionality  of  the  data  is 
accomplished  by  first  applying  a  general  Tensor  Discriminate 
Analysis  Function  then  a  Linear  Discriminate  Analysis 
function.  An  HMM  is  used  to  train  and  test  the  final  features. 


VII.  NON-TRADITIONAL  HIDDEN  MARKOV  MODELS 

Other  forms  of  HMMs  have  been  developed  to  handle 
more  specific  problems  associated  with  HMM  based  action 
recognition  systems  [2,  6,  11,  14,  18,  20,  21,  28,  39,  51,  54, 
60,  66,  70,  76,  77,  79,  81].  Wilson  and  Bobick  [6]  use  a 
Parametric  Hidden  Markov  Model  (PHMM)  to  recognize 
gestures.  The  PHMM  has  an  additional  parameter  used  to 
represent  meaningful  variations  of  gestures  across  the  set  of  all 
gestures.  This  gives  PHMMs  the  ability  to  distinguish 
between  gesture  meanings  with  similar  hand  movements. 

Oliver  et  al.  [11]  developed  a  real  time  system  that  detects 
and  classifies  interactions  between  people  using  a  Coupled 
Hidden  Markov  Model  (CHMM).  They  used  synthetic 
environments  to  model  person  to  person  interactions  and  thus 
creating  their  CHMM.  Data  from  a  static  camera  was  used 
and  moving  objects  were  segmented  and  tracked.  Data 
describing  the  location,  heading,  and  relative  location  to  other 
people  were  inputted  into  the  synthetically  created  CHMM  for 
analysis  and  classification  of  the  interaction  type.  Results 
show  they  outperformed  standard  HMMs.  This  is  not  a  far 
stretch  since  standard  HMMs  work  on  single  automatons 
where  CHMMs  work  on  coupled  automatons,  thus  HMMs 
cannot  outperform  CHMMs  in  this  particular  environment. 

Multi-Observation  Hidden  Markov  Models  (MOHMM) 
are  discussed  in  both  [28]  and  [14]  from  Xaing  and  Gong.  In 
[28]  they  use  MOHMMs  to  create  breakpoints  in  the  video 
content  of  an  activity.  Blobs  above  a  certain  threshold  in  each 
frame  are  segmented  from  the  pixel  change  history.  Several 
functions  of  these  blobs  are  used  in  the  feature  vector  to 
classify  the  video  with  the  MOHMM.  In  [14]  an  MOHMM 
was  used  to  detect  piggybacking  of  people  off  someone  else’s 
security  card  to  open  a  secured,  card  access  only  door. 
Piggybacking  is  when  someone  follows  another  person 
through  a  security  door  without  using  his/her  security  card  to 
open  it.  The  framework  of  the  system  allowed  for  continual 
changes  based  on  changes  in  peoples’  movements,  thus 
unsupervised  learning  is  used  to  continually  update  the  model. 

Gong  and  Xiang  [18]  developed  a  Dynamic  Multi-Linked 
HMM  (DML-HMM)  to  recognize  group  activity  from  an 
outdoor  scene.  The  DML-HMM  is  based  on  salient  dynamic 
inter-linkages  among  multiple  temporal  events  using  Dynamic 
Probabilistic  Networks  (DPN).  Standard  HMMs  cannot  take 
into  account  the  multiple  processes  needed.  The  DML-HMM 
was  designed  to  handle  the  multitude  of  different  object 
events.  The  topology  is  determined  by  the  causality  and 
temporal  order,  which  was  automatically  made  using  the 
Schwarz  Bayesian  Information  Criterion  based  factorization. 
They  claimed  that  instead  of  being  fully  connected  like 
Coupled  HMMs  (CHMM),  the  DHL-HMM  aims  to  only 
connect  a  subset  of  relevant  hidden  state  variables  across 
multiple  temporal  processes.  When  comparing  between  a 
Multi-Observation  HMM  (MOHMM),  a  Parallel  HMM 
(PaHMM),  and  a  CHMM,  the  DHL-HMM  performs  better 
since  the  CHMM  and  the  MOHMM  propagates  the  noise 
through  the  systems  and  the  PaHMM  discards  correlations 
between  multiple  temporal  processes. 
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Continuous  HMMs  (cHMMs)  are  used  in  the  work  of 
Antonakaki  et  al.  [20].  Their  work  classifies  abnormal 
behavior  of  people  based  on  both  their  short  term  behavior  and 
the  global  trajectory  of  each  subject.  A  short  term  behavior  is 
a  behavior  that  can  be  classified  in  twenty  five  frames,  or  one 
second.  A  one  class  support  vector  machine  (SVM)  is  used  to 
distinguish  abnormal  behavior  from  the  short  term  behavior 
sequence.  For  trajectory  data,  a  one  class  cHMM  is  used  to 
determine  if  the  person’s  movement  is  abnormal.  Both  are 
used  to  determine  the  final  results. 

Layered  Hidden  Markov  Models  (LHMMs)  are  used  in 
Oliver  et  al.  [21]  to  detect  specific  activities  in  an  office 
environment.  They  employ  a  two  level  cascade  of  HMMs 
with  three  processing  layers.  The  first  layer  captures  video, 
audio  and  keyboard/mouse  activity  to  create  the  first  level 
feature  vector.  The  middle  layer  has  two  HMMs,  one  for 
creating  an  audio  feature  vector  and  one  for  creating  a  video 
feature  vector.  The  top  layer  uses  the  results  of  these  HMMs 
along  with  keyboard/mouse  activity  and  the  derivative  of  the 
sound  localization  component  as  the  final  feature  vector.  The 
results  from  this  top  layer  determine  the  activity  in  the  office. 
They  claim  the  LHMM  makes  it  feasible  to  decouple  different 
levels  of  analysis  for  training  and  inferences.  By  using  a 
single  HMM  it  would  need  a  large  parameter  space,  thus  need 
a  large  amount  of  data  to  train.  Also,  a  single  HMM  would 
not  be  robust  enough  to  move  to  a  different  office  without 
retraining,  unlike  the  LHMM  claims. 

Liu  and  Chua  [2]  use  Observation  Decomposed  Hidden 
Markov  Model  (ODHMM)  to  model  and  classify  multiple 
people’s  activities.  They  state  that  to  automatically  recognize 
multi  agents  (person,  extremity,  or  object)  is  very  challenging 
due  to  the  complexity  of  interactions  between  agents.  This 
complexity  stems  from  the  large  dimensionality  of  the  feature 
vectors  and  the  complex  mapping  of  agents  from  input  data  to 
pre -defined  activity  models.  To  handle  this  problem,  they 
decompose  each  feature  vector  into  a  set  of  sub-feature  vectors 
for  the  ODHMM. 

Del  Rose  et  al.  [70]  developed  the  Evidence  Feed 
Forward  HMM  to  better  identify  patterns  in  the  observation 
data  for  identification  of  actions.  This  is  a  divergence  from 
common  HMM  theory  which  would  normally  assume  that 
observation  to  observation  linkages  upsets  the  rule  of  causality 
of  the  system.  However,  this  is  not  the  case  since  these 
linkages  are  viewed  as  another  probability  associated  with  the 
closed  system  and  offer  better  results  than  other  HMMs. 

VIII.  SUMMARY  OF  VISUAL  NUMAN  INTENT 
ANALYSIS  DEVELOPMENT 

Across  all  the  previous  research,  several  common 
requirements  stand  out. 

-  Detect  and  segment  objects  in  each  frame 

-  Determine  relative  position  and  orientation  of  each 
object 

-  Identify  a  meaningful  sequence  of  frames  from  the 
visual  input 

-  Store  and  retrieve  past  sequence  of  behaviors  for 
identification  of  current  ones. 

To  detect  and  segment  out  objects  in  each  frame,  there 
needs  to  be  a  well  developed  image  processing  set  of  tools.  If 


the  goal  is  to  identify  the  arm/hand  movements  like  in  [5]  to 
classify  American  Sign  Language  words  from  visual 
identification  of  the  hands,  then  the  image  processing  is  very 
important.  Any  misprocessing  of  the  input  data  that  goes  into 
the  classifier  may  cause  inaccuracy. 

Determining  the  relative  position  and  orientation  of  the 
object  also  requires  good  image  processing  techniques.  In 
some  instances,  just  the  movement  is  important  and  not  the 
exact  location  from  frame  to  frame,  as  in  [11].  This  case 
requires  less  analysis  of  the  processed  input  data  into  the 
classifier  than,  say,  [10]  where  body  orientation  is  important  to 
match  with  preselected  human  actions  from  several  angles. 

To  identify  a  meaningful  sequence  of  frames,  stops 
should  be  placed  before  and  after  each  interesting  area.  There 
is  a  large  amount  of  research  just  on  finding  separations  in 
actions.  If,  for  example,  HMMs  are  to  be  used  and  codebooks 
are  required  to  identify  common  actions  then  having  equal 
length  action  sequences  for  comparison  is  important.  This 
would  require  either  adding  to  the  processed  sequence  several 
inputs  or  subtracting  parts  in  the  middle  which  require  little 
attention,  like  slow  movements.  Either  way,  the  intelligence 
of  the  algorithm  would  have  to  greatly  increased  which 
requires  a  lot  of  work  in  automating.  In  [17],  they  have  taken 
out  three  frames  that  describe  the  action  from  a  small  set.  To 
automate  this  process,  it  requires  a  lot  of  image  processing  to 
correctly  identify  the  starting,  middle  and  ending  location  of  a 
person’s  stride. 

Identification  of  sequence  could  also  mean  sequences  that 
have  no  visual  information  past  on,  like  in  [27].  They  have 
identified  a  way  of  processing  the  visual  information  to  a  point 
where  only  a  few  values  represent  the  sequence.  The  reader  is 
caution  that  the  further  away  the  data  is  from  its  original 
values,  the  more  errors  are  introduced  in  the  system. 

Finally,  to  store  past  behaviors,  there  must  be  knowledge 
of  the  behavior.  In  intelligent  systems,  this  is  usually  done 
through  learning;  however,  it  can  also  be  done  through  human 
intervention.  If  human  intervention  is  performed  on  setting  up 
parameters  for  known  behaviors,  then  often  times  patterns  that 
are  sometimes  found  through  the  learning  process  is  missed, 
thus  causing  misclassification  of  actions.  It  is  suggested  that  a 
combination  of  both  computer  learning  and  human  interaction 
is  used.  This  requires  heavy  analysis  on  the  training  and 
testing  data  to  completely  encompass  the  range  of  each 
activity  being  performed,  a  large  data  set  to  perform  several 
iterations  of  training  and  testing  of  data,  and  in-depth 
knowledge  of  the  minor  facets  of  the  action.  Once  the  user 
has  concluded  that  the  training  data  is  complete  and  the 
baseline  action  sequences  are  stored,  then  to  retrieve  them 
takes  nothing  more  than  a  comparison  of  a  newly  processed 
action  to  the  most  likely  candidate  stored. 

IX.  Conclusion 

Much  of  the  human  brain  is  set  aside  for  processing  the 
visual  sense.  As  computing  power  has  continually  increased, 
and  as  ever  great  push  is  made  for  efficiencies  in  business  and 
government,  letting  automated  computers  perform  heretofore 
human  visual  tasks  could  lead  to  great  efficiencies  and 
improvements.  Visual  Human  Intent  Analysis  (VHIA)  is  a 
wide  open  field  of  research  with  several  different  methods  that 
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relate  to  individual  solutions,  like  identifying  hand  gestures, 
understanding  different  tennis  strokes,  identifying  office 
activities,  and  many  others.  Each  method  described  above 
plays  on  the  solution’s  strengths  and  weaknesses  whether  it  is 
simplified  classification  with  heavy  pre-processing  or  a  more 
intelligent  decision  system.  The  wide  range  of  methods 
demonstrates  the  openness  to  new  and  innovative  solutions 
that  are  catered  to  one’s  own  problem. 

However,  with  all  the  different  techniques  there  are  four 
common  tasks  which  every  VHIA  systems  must  perform: 
detect  and  segment  objects  in  each  frame,  determine  relative 
position  and  orientation  of  each  object,  identify  a  meaningful 
sequence  of  frames  from  the  visual  input,  and  store  and 
retrieve  past  sequence  of  behaviors  for  identification  of 
current  ones.  These  tasks  are  performed  in  different  ways 
depending  on  the  type  of  classification  system.  They  require  a 
lot  of  image  processing,  analysis  on  the  input  data,  and  in- 
depth  knowledge  of  the  actions  with  respect  to  the  processed 
data. 
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